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ABSTRACT 

This  short  report  addresses  the  question  of  using  The  NYU  Ultracomputer 
for  the  generation  of  prime  numbers. 

1.  Introduction 

We  wish  to  find  all  prime  numbers  between  2  and  N.  The  classic  algorithm  in  this  area 
is  the  sieve  of  Eratosthenes,  named  after  the  third  century  B.C.  Greek  astronomer  and  geo- 
grapher. The  algorithm  works  by  removing  non-primes  from  the  set  {  2  ,  ..  ,  N  }  using  only 
additions.  Every  non-prime  is  found  once  for  each  of  its  prime  factors  (not  greater  than 
*^A'),  resulting  in  an  arithmetic  complexity  of  &iN*log(logM))  additions  [  1  ].  The  space 
complexity  of  this  algorithm  is  €>{^').  For  implementing  a  program  using  Eratosthenes' 
sieve  it  suffices  to  use  the  simplest  data  structure,  namely  an  array. 

In  the  last  several  years,  some  linear  sieve  algorithms  have  been  found  [1,2,3].  The 
main  feature  of  these  algorithms  is  the  fact  that  each  non-prime  is  found  just  once.  How- 
ever, these  algorithms  have  some  defects  for  solving  our  problem.  First,  they  are,  essentially 
sequential;  in  other  words,  they  use  all  preceding  information  for  each  consecutive  step. 
Secondly,  the  programming  realization  of  these  algorithms  is  much  more  complicated.  It 
demands  a  more  complicated  data  structure,  such  as  "linked  list". 

We  chose  to  make  a  parallel  sieve  program  as  an  exercise  . 

2.  The  Problem  of  Saving  Memory 

The  other  merit  of  Eratosthenes'  sieve  is  the  possibility  it  offers  for  saving  space,  which 
is  a  most  important  issue.  A  program  using  the  classic  Eratosthenes'  algorithm  exhausts  all 
available  memory  of  our  prototype  Ultra  in  several  seconds. 

For  the  classic  Eratosthenes'  sieve  we  used  a  one-dimensional  array  A[i]  (i  from  2  to 
N)  of  logical  variables.  Originally  we  set  all  entries  to  be  "TRUE".  Then  consecutively  we 
take  those  j  from  2  to  N,  for  which  A[j]=  "TRUE"  (prime),  and  we  mark  off  all  numbers 
which  can  be  divided  by  j.  In  other  words  we  set  for  all  i  =  2j  ,  3j  ,  ..  less  than  or  equal  to 
N  A[i]   to  be  "FALSE". 

Numbers  i  such  that  A[i]  is  "TRUE"  at  the  end  are  in  fact  prime. 

The  opportunity  to  save  memory  is  connected  with  breaking  the  whole  array  A[i]  into 
some  zones  of  equal  length   n,  and  using  three  arrays  :  A[i]   (for   i   from    1   to   n  )  of  logi- 


cal  variables,  P[j]  (  for  j  from  1  to  <j)('^)  *  )  contains  primes  from  2  to  ^  ,  and  MP[j]  ( 
for  j  from  1  to  ^i"^^))  contains  integers  from  which  we  shall  begin  removing  numbers 
which  can  be  divided  by  P[j]  in  the  next  zone.  Now  we  can  divide  all  the  work  into  several 
steps  : 

1.  finding  all  primes  up  to  ^^  using  classic  Eratosthenes'  sieve,  putting  them  into  array 
P[j]  and  printing  them  out; 

2.  setting  all  MP[j]  to  be  equal  to   0; 

3.  setting  all  A[i]  for  i  from  1  to  n  to  be  "TRUE"; 

4.  for  every  j  setting  A[i]  for  i  =  MP[j].  MP[j]  +  P[j],  MP[j]  +  2*P[j]....  ^n  to  be 
"FALSE",  and  after  it  setting  MP[j]  to  be  (MP[j]  +  k*P|j])  -  n,  where  k  is  the 
minimum  integer  such  that  (MP[j]  +  k*P[j])  >  n  (every  MP[j]  is  greater  than  zero 
and  less  than  or  equal  to   P[j]  ); 

5.  printing  out  all  primes  from  n  ♦  (C-1)  to  n  *  C,  where  C  is  the  number  of  the  loop 
(if  A[i]   left  "TRUE"  at  the  end,  then   n  *  (C-1)  +  i  is  prime); 

6.  return  to  step  #3. 

The  best  possibility  of  space  complexity  is  ©["^A^  /  log'^A^J  because  <()  [^^]    approxi- 
mates V^^  /  log^Jv  [  1  ]. 

3.  Parallelization 

We  use  the  fact  that  the  Ultra  supports  concurrent  stores. 

The  main  principle  of  parallelization  lies  in  using  several  different  integers  not  yet 
shown  to  be  nonprimes  to  mark  off  elements  A[i]  which  are  multiples  of  them. 

For  synchronization  we  use  the  function  FAA  (Fetch  &  Add  )  only  . 

The  schemes  of  parallel  algorithms  for  realization  of  classic  Eratosthenes'  sieve  and  the 
improved  one  for  saving  memory  are  represented  in  FIG.l  and  FIG. 2  . 

4.  Time  Complexity  of  Serial  Algorithms 

The   time   complexity   of  Eratosthenes'  sieve  can   be   calculated   by  the   quantity  of 
nonprimes,  which  were  marked  off. 


*  4)(n)  here  and  further  is  a  function  equal  to  the  number  of  primes  less  than  or  equal  to  n. 
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Approximating   i  -th   prime  as  i  log  i  ,  and  <})(n)   as  n/log  n  [  1  ]  we  have 


CiN) 


ji^_N__  ^  ^  ^/^^ 1 


1  =  2     (Z  log/)  1  =  2        (j  logo 


V^/IobVat 


1 
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^/log^ 


dx  =  A^loglogjT  I        2 


=  A'loglog 


r  v^_i 


llog'^J 


(xlogjr) 
-  A'loglog2  =  NloglogV]v  -  MogloglogVA^  -  A^loglog2  = 


~  ATloglogV^  =  €KN  loglogA^) 
The  time  complexity  of  the  algorithm  which  is  connected  with  saving  memory,  is: 

Where    C(N)  is  the  time  complexity  of  Eratosthenes'  algorithm  and  <t)[^A'] —    -  quantity 
of  operations,  which  is  necessary  for  updating  of  the  array  MP. 

But  4)  {y^  "^  <  ^  '  because  n  >  <^  f^A^  . 

Thus,  the  time  complexity  of  the  algorithm  for  saving  memory  is  the  same  as  for  classic 
Eratosthenes'  siev'e   -   ©(A'  loglog,V)  . 

5.   Time  Complexity  of  Parallel  Algorithm 

From  the  very  essence  of  the  parallel  algorithm  it  follows  that  for  small  number  of  pro- 
cessors P  the  work  can  be  divided  completely  among  the  processors.  Assuming  that  the 
speed  of  different  processors  is  nearly  the  same  it  is  easily  seen  that  the  additional  work 
caused  by  checking  divisibility  by  nonprimes  takes  place  just  at  the  first  stage.  So  it  is  rela- 
tively small.  Terefore  the  time  complexity  of  the  parallel  algorithm  is: 

C 

^par  p 

where  P  is  the  number  of  processors. 

For  large  numbers  of  processors  it  is  N/2  (all  other  processors  wait  until  the  processor 
which  is  responsible  for  divisibility  by  2  finishes  its  work). 

The  calculation  shown  below  represents  an  attempt  to  establish  the  relationship  between 
the  maximum  number  of  useful  processors  and  the  size  of  the  problem   N. 
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The  key  inequality  is 


P  ~   2 


It  means  that  the  work,  which  should  be  done  by  one  processor,  is  greater  than,  or  equal  to 
the  work,  which  should  be  done  by  processor  dealt  with  divisibility  by  2. 

So,  from  the  key  inequality,  we  have: 

P  <  2  loglog^^  =  CKloglogA^). 

The  relationship  between  the  size  of  the  problem   N  and   the  function  21oglog'vAf   is  shown 
in  FIG.  3. 

The  table  of  theoretical  time  complexity  is  represented  in  FIG. 4. 

The  results  of  timing  experiments  on  the  Ultracomputer  prototype  for  N  =  5,000,000 
and  different  numbers  of  processors  are  represented  in  FIG.  6. 

The  fact,  that  some  growth  of  the  speed  is  continuing  after  the  critical  number  of  pro- 
cessors, can  be  explained  by  presence  in  our  program  the  other  subprograms  besides  Era- 
tosthenes' sieve  itself.  Such  subprograms  as  setting  all  entries  of  arrays  to  be  zero  and  the 
calculation  of  the  number  of  primes  after  sieve  can  be  parallelized  perfectly. 

6.   Possible  Other  Parallel  Algorithms 

In  the  beginning  of  this  paper  we  wrote  that  some  new  linear  algorithms  are  more  diffi- 
cult for  parallelization  than  the  Eratosthenes'  one. 

As  we  showed  above  the  sieve  of  Eratosthenes  can  be  easily  parallelized  but  the 
number  of  processors  for  the  effective  use  of  it  is  relatively  small. 

Let'  s  try  to  get  the  more  efficient  parallel  algorithm  for  finding  prime  numbers  by 
further  modification  of  the  serial  one. 

a)  First  let's  consider  the  algorithm  which  marks  off  each  nonprime  K  by  each  of  its 
divisors  less  than  or  equal  to  ^K  whether  they  are  primes  or  not.  For  this,  like  for 
Eratosthenes'  sieve  we  use  a  one-dimensional  array  A[i]  (  i  from  2  to  N  )  of  logical 
variables.  Originally,  we  set  all  entries  to  be  "TRUE".  Then  consecutively  we  take  all 
j  from   2  to  "^N  and  mark  off  all  numbers   i  equal  to  p  , 

Time  complexity  for  such  an  algorithm  is  : 
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(A^  -  /•) 


a  = 


const,  is  a  time  which  is  needed  to  compute  j^  . 

Vv    1         VaF 


=  A^logxl  2    -  y +  ctV^  =  ^  logVA^^  =  €KN  logN) 

b)  The  algorithm  a)  marks  off  each  nonprime  number  K  by  each  of  its  divisors  less 
than  or  equal  to  ^K  .  The  algorithm  b)  is  its  reflection.  It  marks  off  each  nonprime  K 
by  each  of  its  divisors  equal  to  or  greater  than  *  AT  . 

In  other  words  we  use  array  A[i]  (  i  from  2  to  N  ),  set  all  entries  to  be  "TRUE". 
Then  we  take  all  j  from  2  to  N/2  and  mark  off  all  numbers  i  =  2j  ,  3j  ,...  up  to  y'^  or 
N  what  ever  is  less. 

Time  complexity  of  this  algorithm  must  be  the  same  as  it  was  for  a)  -algorithm, 
because  any  number  K  has  the  same  quantity  of  divisors  less  than  "^AT  as  it  has  greater 
than  ^^  . 

But  let's  calculate  it  : 


dx  ~ 


s,i 


~  ^  +  A^logJc|V^  =  ^-  +  N 


■ 

log 

2 

k 

N             J 

-  log^^ 


~  A^log 


2 

V      / 


=  e(N  ]ogN) 


The  last  algorithm  can  be  parallelized  like  Eratosthenes'  sieve  does.    But    if  we  use  small 

number  of  processors  P  its  time  complexity  is    ^     °^^   {1}  ,  and  if  we  use  large  number  of 

processors  it  is  ^A/  {2},  because  the  biggest  job  is  done  by  the  processor  which  marks  off 
the  numbers  which  can  be  divided  by  ^A'  . 
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Moreover  the  bound  on  the  number  of  useful  processors  which  is  the  number  P  for 
which  formula  {1}  and  formula  {2}  are  equal,  is  much  higher  than  for  Eratosthenes'  sieve.  It 
can  be  evaluated  by  the  equation  : 

^^  (A^  logAQ 
'  P 

So  : 

p  =  e  [  V5v  logA^  ] 

The  time  complexity  of  parallelization  of  the  last  algorithm  is  shown  in  FIG.  5. 

7.  Conclusion 

Both  the  classic  algorithm  and  the  improved  one  for  saving  memory  were  carried  out 
on  the  NYU  Ultracomputer.  Results  confirmed  the  theoretical  complexity  estimates. 

Now  the  program  continues  to  be  used  for  testing  the  ultra. 
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PRIMES 


SEM  :=   2 


A[i]  :=  TRUE  (for  i  from  2  lo  N) 


i  :=    FAA  (&.  SEM.  1) 


I  ALSt 


FALSE 


TRUE 


i:=    2*j 


A[i]  :=   FALSE 


1  =   i  +  j 


OUTPUI' 


END 


Fig.]  The  scheme  of  parallel  algorithm  for  realization  of  classic 

Eratosthenes'  sieve. 
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PRIMES  (n/N)  -^  P[j]  (for  j  from  1  to  (p(VN)    | 


OUTPUT 


c  =  0 


MP(j)  ;=   0  (for  j  from  1  to  (p(%yN) 


FALSE 


TRUE 


c  :=   c  +    1 


SEM  :=    1 


A|il  :=   TRUE  (for  i  rrom  1  to  n) 


]  :=    FAA  (ct  SEM.  1) 


I  AI.SL 


TRUE 


i  =   MPLJ] 


<  n^  I-'aLSL 


END 


MP[j]  :=  1  -  n 


Fig. 2  The  scheme  of  parallel  algorithm  for  realization  of  improved 
sieve  for  saving  memory. 
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N 

100 

10,000 

1.000,000 

hundred  millions 

ten  billions 

p=2iog  log^^ 

1.6 

3.0 

3.8 

4.4 

4.8 

Fig. 3  The  relationship  between  the  size  of  the  problem  N 
and  critical  number  of  processors. 


NUMBER  OF  PROCESSORS 

P=l 

P  ^  P critical 

P   ^  'critical 

TIME  COMPLEXITY  Cp 

Ci  =  0(A'*/o^log.V) 

p 

e(N) 

Fig. 4  The  table  of  theoretical  time  complexity  of  parallel  Eratosthenes' 

sieve  algorithm. 


NUMBER  OF  PROCESSORS 

P=l 

P   <  Pcri,=  &(^l0gX) 

P   >  Peri, 

TIME  COMPLEXITY  Cp 

Ci=e(^iogA^) 

p 

e{^) 

Fig. 5  The  table  of  theoretical  time  complexity  of  the  modified  algorithm. 
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Fig. 6  The  results  of  timing  experiments  on  the  Ultracomputer  prototype 
for  N  =  5,000,000  and  different  number  of  processors. 
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